Abstract:Graph similarity join has extensive use in the field of data mining, especially in data pre-processing, it could be applied to data cleaning, near duplicate detection, etc. Thus, it is of great importance to study graph similarity join. Graph similarit join based on edit distance constraints is studied, that is, all the edit distances in the return pair of graphs are no larger than a given threshold. Based on MapReduce programming model, an algorithm named MGSJoin is proposed with the ″filtering-verification″ framework, and it relies on graph signatures of path-based q-grams for filtering out non-promising candidates, i.e. count filtering.With the potential issue of too many key-value pairs, Bloom Filter is introduced to improve the algorithm and BMGSJoin is designed. The improvement of efficiency and scalability by the proposed algorithm is demonstrated by extensive experimental results, and it may meet the current challenges of big data mining and analysis.
[1] Yan X F, Han J W. gSpan: Graph-Based Substructure Pattern Mining // Proc of the IEEE International Conference on Data Mining. Maebashi, Japan, 2002: 721-724 [2] Yan X F, Yu P S, Han J W. Graph Indexing: A Frequent Structure-Based Approach // Proc of the ACM SIGMOD International Conference on Management of Data. Paris, France, 2004: 335-346 [3] He H H, Singh A K. Closure-Tree: An Index Structure for Graph Queries // Proc of the 22nd International Conference on Data Engineering. Atlanta, USA, 2006: 38.1-38.12 [4] Williams D W, Huan J, Wang W. Graph Database Indexing Using Structured Graph Decomposition // Proc of the 23rd International Conference on Data Engineering. Istanbul, Turkey, 2007: 976-985 [5] Tian Y Y, Patel J M. Tale: A Tool for Approximate Large Graph Matching // Proc of the 24th IEEE International Conference on Data Engineering. Cancun, Mexico, 2008: 963-972 [6] Yan X F, Yu P S, Han J W. Substructure Similarity Search in Graph Databases // Proc of the ACM SIGMOD International Conference on Management of Data. Baltimore, USA, 2005: 766-777 [7] Zhao X, Xiao C, Lin X M, et al. Efficient Graph Similarity Joins with Edit Distance Constraints // Proc of the 28th IEEE International Conference on Data Engineering. Washington, USA, 2012: 834-845 [8] Sanfeliu A, Fu K S. A Distance Measure between Attributed Relational Graphs for Pattern Recognition. IEEE Trans on Systems, Man and Cybernetics, 1983, SMC-13(3): 353-362 [9] Bunke H, Allermann G. Inexact Graph Matching for Structural Pattern Recognition. Pattern Recognition Letters, 1983, 1(4): 245-253 [10] Zeng Z P, Tung A K H, Wang J Y, et al. Comparing Stars: On Approximating Graph Edit Distance // Proc the 35th International Conference on Very Large Data Bases. Lyon, France, 2009: 25-36 [11] Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 2008, 51(1): 107-113 [12] Deng D, Li G L, Hao S, et al. MassJoin: A MapReduce-Based Method for Scalable String Similarity Joins // Proc of the 30th IEEE International Conference on Data Engineering. Chicago, USA, 2014: 340-351 [13] Bloom B H. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 1970, 13(7): 422-426 [14] Broder A, Mitzenmacher M. Network Applications of Bloom Filters: A Survey. Internet Mathematics, 2004, 1(4): 485-509 [15] Koutris P. Bloom Filters in Distributed Query Execution [EB/OL]. [2014-01-23]. http://courses.cs.washington.edu/courses/cse544/11wi/projects/koutris.pdf [16] Cohen S, Matias Y. Spectral Bloom Filters // Proc of the ACM SIGMOD International Conference on Management of Data. San Diego, USA, 2003: 241-252 [17] Lin X M, Wang W. Set and String Similarity Queries: A Survey. Chinese Journal of Computers, 2011, 34(10): 1853-1862 (in Chinese) (林学民, 王 炜.集合和字符串的相似度查询.计算机学报, 2011, 34(10): 1853-1862 [18] Pang J, Gu Y, Xu J, et al. Research Advance on Similarity Join Queries. Journal of Frontiers of Computer Science and Technology, 2013, 7(1): 1-13 (in Chinese) (庞 俊,谷 峪,许 嘉,等.相似性连接查询技术研究进展.计算机科学与探索, 2013, 7(1): 1-13) [19] Wang G, Wang B, Yang X C, et al. Efficiently Indexing Large Sparse Graphs for Similarity Search. IEEE Trans on Knowledge and Data Engineering, 2012, 24(3): 440-451 [20] Cohen J. Graph Twiddling in a MapReduce World. Computing in Science & Engineering, 2009, 11(4): 29-41 [21] Lattanzi S, Moseley B, Suri S, et al. Filtering: A Method for Solving Graph Problems in MapReduce // Proc of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. San Jose, USA, 2011: 85-94 [22] Suri S, Vassilvitskii S. Counting Triangles and the Curse of the Last Reducer // Proc of the 20th International Conference on World Wide Web. Hyderabad, India, 2011: 607-614 [23] Kang U, Tsourakakis C E, Appel A P, et al. HADI: Mining Radii of Large Graphs. ACM Transactions on Knowledge Discovery from Data, 2011, 5(2): 8.1-8.24 [24] Bahmani B, Chakrabarti K, Xin D. Fast Personalized Pagerank on MapRreduce // Proc of the ACM SIGMOD International Conference on Management of data. Athens, Greece, 2011: 973-984 [25] Kang U, Tong H H, Sun J M, et al. Gbase: A Scalable and General Graph Management System // Proc of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, USA, 2011: 1091-1099 [26] Kang U, Tsourakakis C E, Faloutsos C. Pegasus: A Peta-Scale Graph Mining System Implementation and Observations // Proc of the 9th IEEE International Conference on Data Mining. Miami, USA, 2009: 229-238 [27] Bahmani B, Kumar R, Vassilvitskii S. Densest Subgraph in Streaming and MapReduce. Proceedings of the VLDB Endowment, 2012, 5(5): 454-465 [28] Afrati F N, Fotakis D, Ullman J D. Enumerating Subgraph Instances Using Map-Reduce // Proc of the 29th IEEE International Conference on Data Engineering. Brisbane, Australia, 2013: 62-73